Skip to content

perf(crypto): specialize keccak256 for the fixed-shape Merkle parent hash#774

Open
Oppen wants to merge 2 commits into
perf/rkyv-serializationfrom
perf/merkle-block-keccak
Open

perf(crypto): specialize keccak256 for the fixed-shape Merkle parent hash#774
Oppen wants to merge 2 commits into
perf/rkyv-serializationfrom
perf/merkle-block-keccak

Conversation

@Oppen

@Oppen Oppen commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Specializes hash_new_parent (Merkle parent hash — always exactly 64 bytes, two 32-byte nodes) for Keccak256/32-byte nodes to a single hand-rolled keccak-f[1600] permutation, bypassing sha3's generic incremental-Digest/block_buffer machinery. Dispatch is a TypeId-based compile-time-constant check (D == Keccak256 && NUM_BYTES == 32); every other digest/size falls through unchanged to the original generic path.
  • Also fast-paths FieldElementPairBackend::hash_data (always exactly 2 small field elements, always sub-rate) the same way. Note: this backend is used prover-side only (FriLayerMerkleTreeBackend) — the guest-side cycle win below comes entirely from the hash_new_parent change.
  • Deliberately does not fast-path FieldElementVectorBackend::hash_data (wide trace-row leaves): measured that a "try one block, else fall back" attempt is a net regression, since real rows always exceed the 136-byte keccak rate and the attempt is pure wasted overhead. Left on the generic path, with a comment explaining why.
  • Tier 1 of the keccak-cost-reduction work scoped in .refs/merkle-keccak_handoff.md; tier 2 (routing the permutation through the VM's keccak precompile) is explicitly out of scope here — it needs its own outer-proving-cost measurement gate.

Measurements (recursion verifier guest, cycle-accurate)

single-query multi-query
baseline (perf/rkyv-serialization) 89,721,844 2,210,366,539
this PR 89,081,698 2,152,039,894
delta −640,146 (−0.7%) −58,326,645 (−2.6%)

In-VM correctness verified at every step: the guest's committed 32-byte output digest is byte-identical to baseline across all intermediate versions of this change.

Review

Went through two rounds of adversarial review (performance, correctness, cryptographic soundness, implementation simplicity) against this exact diff. First round surfaced real issues (parent-hash fast path was doing a redundant buffer copy and wasn't being inlined across the crate boundary; test coverage only pinned the inner permutation helper, not the dispatch condition itself) — all fixed and re-measured. Second round came back clean: no correctness or soundness defects, byte-for-byte lane/padding arithmetic independently verified against the Keccak spec, prover/verifier path-selection symmetry confirmed, fail-safe TypeId-mismatch fallback re-verified.

Test plan

  • cargo test --workspace --exclude math-cuda (math-cuda needs a GPU, not available here)
  • make test-ethrex
  • cargo test -p crypto -p stark
  • New unit tests: byte-identity of the hand-rolled permutation against sha3::Keccak256 (1000 random 64-byte pairs, all sub-rate lengths 0..135), and dispatch-level pinning tests comparing the trait-method output (not just the inner helper) against an independent reference
  • cargo test -p lambda-vm-prover --lib test_recursion_execute_1query -- --ignored --nocapture — in-VM verify accepted, output digest unchanged from baseline
  • make test-profile-recursion-single / -multi — cycle counts above

Oppen added 2 commits July 3, 2026 15:17
hash_new_parent always hashes exactly 64 bytes (two 32-byte nodes), which
fits the keccak rate in one block. Skip sha3's block_buffer/Digest
machinery and run keccak-f[1600] directly for the Keccak256/32-byte case;
fall back to the generic Digest path for every other backend
instantiation. Cuts ~23M cycles (~1%) off the recursion verifier guest's
multi-query profile.
Address adversarial-review findings on the previous commit:

- hash_new_parent's fast path now builds the keccak state directly from
  left/right (no intermediate 136-byte buffer, no owned-array copies) and
  is #[inline], so it collapses into callers instead of costing a real
  cross-crate call plus redundant copies.
- FieldElementPairBackend::hash_data (always 2 small elements, always
  single-block) gets the same single-permutation treatment.
- FieldElementVectorBackend::hash_data deliberately keeps the plain
  multi-block Digest path: its inputs are whole trace rows, which always
  exceed the keccak rate, so a "try one block, else fall back" attempt
  measured as a net regression (all cost, no payoff) rather than a win.
- Added tests pinning the TypeId dispatch itself (not just the inner
  permutation helpers) against an independent sha3 reference.

Cuts guest cycles further: single-query 89,632,723 -> 89,081,698,
multi-query 2,187,109,919 -> 2,152,039,894 (vs. original baseline
89,721,844 / 2,210,366,539 -- roughly -0.7% / -2.6% total).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant